Overview

This notebook aims to assess our ability to call genomic variants accurately by comparing replicate sequencing runs.

The samples consist of 24 sequencing runs with a Ct value of less than 20. Originally, samples were sequenced once as part of this study. The runs with a low enough Ct value to be considered for re-sequencing by a shotgun metagenomic method were re-sequenced as replicates. One sample is currently excluded from this analysis because it is not included on the SRA (SpID: 10110), leaving 23 final samples.

By comparing the allele frequency of variants called in both replicates, we hope to assess the accuracy, the optimal frequency cutoff, and the optimal coverage cutoff for downstream bottleneck analyses.

Replicate Information

The critical difference between the replicates is the processing. The samples on the SRA underwent a slightly different pre-processing pipeline to remove human reads and trim adaptors. Eventually, I’ll re-process the raw files to make the pipelines identical.

Variant Calling

For this preliminary analysis, I identified variants by two methods. First, I aligned the samples with BWA, and called variants with Varscan (I removed any filters for minimum allele frequency, depth, or supporting reads). Second, I identified each base at every position in the genome, and it’s location on either the forward or reverse strand using Samtools mpileup.

Depth of Coverage

I think that one of the limiting factors in this analysis is the lack of coverage and the disparity in coverage between replicates. I’m not sure if this difference is due to pre-processing or random variation from re-sequencing.

Below, I’m comparing the difference in the number (and %) of mapped reads between replicates. There are some substantial differences between replicates.

There is also a pronounced disparity in the patterns of coverage over the genome between replicates. Below, I’ve plotted the depth of coverage in 50 Kb bins over the genome. I filtered out the last 53 bp (2 bins) of the genome because there is a poly-A track with a tremendous amount of reads that align, which skews the visualization. I will figure out a way to handle these alignments.

Also, interestingly, there is a clear pattern of coverage over the genome in the second replicate (blue) that doesn’t seem so evident in the first. I’m not sure what to make of this yet. Perhaps this pattern is a consequence of using a different pre-processing pipeline than the SRA replicates.

Consensus Variants

Next, I wanted to identify and remove all consensus variants shared between the samples. Optimally, I could do this in the analysis pipeline upstream of the final alignment. However, by identifying consensus variants at this stage, I can check for any systematic biases or samples that might skew the cohort’s consensus sequence by having a replicate with too little depth over a region.

To identify consensus variants, I filtered for any variants that appear at greater than or equal to 50% frequency in either replicate and counted the number of samples with these variants.

There are two SNPs (indicated in Red) that show up in all 23 samples. There are another ~7 SNPs that are likely consensus variants, but are missing from either one or two samples (or replicates).

I was curious if these SNPs present in 21 or 22 samples were genuinely absent from the remaining samples. I checked if there was a bias in what samples were missing consensus variants.

## [1] "C7564T-SRR11939986"
## [1] "G11083T-"
## [1] "G28376T-SRR11939986"

There are two samples (SRR11939959 and SRR11939986) that are missing three and four variants respectivley.

Samples Missing Consensus Variants
Accession Replicate Reads Mapped Percent Mapped Average Ct
SRR11939958 2 155589 20.817534 19.885
SRR11939958 1 454944 61.758921 19.885
SRR11939959 2 150975 15.068719 19.900
SRR11939959 1 130396 3.861836 19.900
SRR11939986 2 1021770 54.889163 19.025
SRR11939986 1 85900 1.679103 19.025

Samples SRR11939959 and SRR11939986 have at least one replicate with a very low percentage of reads mapped. The following SNPs were excluded becuase they were missing from a single replicate:

These SNPs do appear to be true consensus SNPs. Two SNPs are genuinely missing from both replicates of SRR11939986 and can’t be easily explained by low coverage:

Therefore, the following 7 SNPs are the true consensus SNPs and should be masked from downstream analyses.

## [1] "A23403G" "C1059T"  "C14408T" "C241T"   "C3037T"  "G25563T" "G11083T"

Number of SNPs

Before I compare the SNPs called in the replicates, I wanted to see how many SNPs were called in each sample.

There are three main takeaways from the above plots:

  1. There are significantly more variants shared between replicates called by the pileup file.
  2. About seven samples have fewer than five variants shared between replicates.
  3. There is a correlation between the number of mapped reads in each replicate and the number of SNPs in each method.

Replicate Comparison

Next, I wanted to assess the variability in the frequency of alternative alleles called in each replicate. Samples with significant variability likely have low viral template number and can be excluded from downstream analyses. Additionally, I should be able to determine which samples need to be sequenced to higher depth.

First, I plotted the most simple case. Here are all of the SNPs called by Varscan excluding the seven consensus variants shared by all samples.

Several samples have nearly no variants shared between replicates at a frequency of less than ~1. These include the following samples:

I labled these samples as the “Worst” quality replicates. Below is a table summarizing their depth and Ct.

No Low Frequency Variants
Accession Replicate Reads Mapped Percent Mapped Average Ct
SRR11939954 2 71229 11.105654 17.760
SRR11939954 1 688885 96.983302 17.760
SRR11939959 2 150975 15.068719 19.900
SRR11939959 1 130396 3.861836 19.900
SRR11939997 2 502378 26.931036 19.845
SRR11939997 1 418272 33.464276 19.845
SRR11939998 2 1861491 38.530957 19.865
SRR11939998 1 213574 8.680307 19.865

All four samples have relatively few reads mapped and in the case of SRR11939959, SRR11939997, and SRR11939998 comparatively high Ct values (~19 cycles).

Otherwise, the main takeaway from the plot above is that there are few variants between low frequency (> 5%) and high frequency (> 90%). Also, as expected, high-frequency variants have good concordance between replicates. Therefore, it’s easier to visualize with a log transformation.

Some samples have better concordance between allele frequencies than others. However, the relationship is easier to visualize with insertions and deletions included.

With this, I can break the samples into either “Good” or “Bad” quality groups. The “Bad” quality group excludes the “Worst” quality samples mentioned above.

Pileup Variants

Finally, I wanted to see if these relationships hold when I include the variants I identified with mpileup. Below, these variants are layered over the variants called by Varscan. The line showing the relationship between allele frequencies is drawn for the Varscan SNPs and is colored by the sample-quality mentioned above.

“Bad” and “Worst” quality replicates do not have concordant allele frequencies. However, it’s not clear for all samples, so I log-transformed the axes hoping to get a better sense.

There is a pretty significant amount of variability regardless of the concordance between allele frequencies called by Varscan. The alleles annotated in the pileup statistics are possible too lenient since I haven’t applied any filters. I noticed by looking at the data that variants shared between mpileup and Varscan have at least seven supporting reads in each replicate. This number isn’t hardcoded into the variant calling by Varscan, but is probably a consequence of the maximum coverage. From looking at the manual for Varscan, that’s the only explanation that I have.

I applied the filter that any allele must be supported by at least 7 reads in either replicate. The remaining alleles show pretty good concordance, even the variants only called from the pileup file.

How to Proceed?

Samples with very few variants fall both into the “Worst” and “Bad” categories. Since these samples have so few variants, they can’t be assessed based on the variability in allele frequency. The only way to remedy this is to sequence more. However, since most of the “Worst” samples have relatively high Ct values, this might be futile. These samples are as follows:

Next are the samples with comparatively many shared variants. Of these, four have a similar number of variants to the “Good” samples, but more variability. These are the samples I assume have low template diversity. These samples are:

Finally, there are the “Good” samples that probably don’t need much more re-sequencing:

The worst samples have low depth and high Ct. The best samples have high depth and low Ct. With a better statistic for variability than manually binning the runs into categories, it should be easier to correlate the depth and Ct with the replicates’ accuracy. Effective depth should accomplish this.

Shared Variants

Using only the “Good” quality replicates, curious what the extent of shared variation was between samples. I used variants called by Varscan with coverage > 200X in both replicates and more that 7 supporting reads in both replicates. First, I made a matrix with the all of the variants.

Then, I made a matrix with only the minor variants.

The number of variants shared between samples has a lot to do with the number of variants the sample starts with. On average, samples shar about half of their variants with other samples. However, most of these are low frequency (< 1%).